All of this is new and quite promising, I do not know anything about the subject and learning will be probably challenging. However I am eager to start. I have experienced some difficulties in this first task as I am unfamiliar with the whole thing. In the end, I succeeded to overcome those problems.
I’d like to learn about processing of large datasets to be used in my research. My specialization is in political science and organizations. So, I would like to learn mainly about working with statistics and also with data visualization.
My supervisor has warmly recommended me to enrol in this course.
First of all, I read the new dataframe with the function read.table().
Then I use of the function dim() to show the dimension of the dataframe that is of 166 objects and 7 variables as explained in the above-mentioned data wrangling section:
## [1] 166 7
By typing the function str(), it shows the structure of the dataframe:
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
To visualize the data I type the functions install.packages() to install the visualization packages ggplot2 and GGally. Then by using the function library() I open them in the project.
install.packages(“ggplot2”)
install.packages(“GGally”)
library(ggplot2)
The libraries are opened:
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
By using the fast plotting function pairs(), we can see a scatterplot matrix resulting from the draws of all the possible scatterplots from the columns of the dataframe. Different colors are used for males and females.
This second plot matrix is more advanced, and it is made with ggpairs().
The summary of the variables:
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The females are almost double of the males who present a wider age range. The summary suggest a significant correlation for surf vs deep and points vs attitude.
I have chosen three variables “attitude”, “deep learning” and “surface learning”, with the target variable “exam points” to fit a regression model analysis.
Drawing a plot matrix with ggpairs().
Fitting the regression models with three explanatory variables and running the summary:
##
## Call:
## lm(formula = Points ~ Attitude + deep + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## Attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
MISSING
The adjusted R square of 0.1876 denotes a poorly fitting function to the explanatory variables. Only attitude presents statistical significance.
A Multiple Linear Regression Model has few assumptions:
a linear relationship between the target variable and the explanatory variables, usually revealed by scatterplots;
a multivariate normality, which means that the residuals are normally distributed. The QQ-plot can reveal it;
the absence of multicollinearity, in other words, the explanatory variables are not highly correlated to each other.
homoscedasticity: or constant variance of errors. There is a similar variance of error terms across the values of the explanatory variable. A plot of standardized residuals versus predicted values shows if the points are equally distributed across all values of the dependent variables.
The diagnostic plots delivered the following observations:
In the residuals vs fitted values, the plot is utilized to check the assumption of linear relationship. An horizontal line, without distinct patterns is an indicator for a linear relationship, in this case, the red line is more or less horizontal at zero. Hence here linear relationship could be assumed.
In the normal QQ-plot the plot reveals the presence of multivariate normality. A good indication is if the residual points follow the straight dashed line. For the majority it is the case here, hence normality can also be assumed.
In the residuals vs leverage, the plot identifies the impact of a single observation on the model. Influential points lie at the upper or at the lower right corner in a position where they are influential against a regression line. In this case the points are on the left side of the plot, thus we can say that there is no leverage.
In this chapter, we will analyze a dataset resulted from the data wrangling of another dataset about students’ performance in high school in mathematics and portugese language. Hereinafter are the variable of the dataset we are going to analyze:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Alltogether the dataset has a dimension of:
## [1] 382 35
I choose four variables of interest on which I build some hypotheses:
| VARIABLE CHOSEN | RELATED HYPOTHESIS |
|---|---|
"goout" |
H1: Students who go out more have higher alcohol consumption. |
"freetime" |
H2: Students who have more free time are more prone to drink. |
"studytime" |
H3: The more the studytime, the less a student drinks. |
"romantic" |
H4: Given the courting dynamics, romantics drink less. |
Let’s start with a graphical overview of the dataset variable distribution, in relation to the alcohol consumption. By installing and calling "tidyr", "dplyr", and "ggplot2", and then by combining the function glimpse() via the pipe operator %>% to the plot generating function ggplot(); we get the following plot:
Let’s now focus on the cross-tabulations for the specific interest variables on which hypotheses were posed:
## # A tibble: 38 x 4
## # Groups: goout [5]
## goout alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 14 11.1
## 2 1 1.5 3 7
## 3 1 2 2 13.5
## 4 1 2.5 2 12
## 5 1 3.5 1 10
## 6 2 1 49 12.6
## 7 2 1.5 21 12.1
## 8 2 2 14 10.3
## 9 2 2.5 8 12.4
## 10 2 3 4 10.8
## # ... with 28 more rows
## # A tibble: 37 x 4
## # Groups: freetime [5]
## freetime alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 10 10.3
## 2 1 1.5 1 16
## 3 1 2 4 12.2
## 4 1 3 1 10
## 5 1 4 1 8
## 6 2 1 18 13.3
## 7 2 1.5 24 12.9
## 8 2 2 7 10.3
## 9 2 2.5 8 11.8
## 10 2 3 6 10.7
## # ... with 27 more rows
## # A tibble: 30 x 4
## # Groups: studytime [4]
## studytime alc_use count mean_grade
## <int> <dbl> <int> <dbl>
## 1 1 1 21 12.2
## 2 1 1.5 20 10
## 3 1 2 17 10.1
## 4 1 2.5 10 11.5
## 5 1 3 12 10.1
## 6 1 3.5 11 9.09
## 7 1 4 4 13
## 8 1 5 5 9.6
## 9 2 1 72 11.7
## 10 2 1.5 35 12.1
## # ... with 20 more rows
## # A tibble: 18 x 4
## # Groups: romantic [2]
## romantic alc_use count mean_grade
## <fct> <dbl> <int> <dbl>
## 1 no 1 98 12.3
## 2 no 1.5 47 11.8
## 3 no 2 35 11.3
## 4 no 2.5 32 11.8
## 5 no 3 25 10.4
## 6 no 3.5 13 11.1
## 7 no 4 6 10.3
## 8 no 4.5 1 10
## 9 no 5 4 10
## 10 yes 1 42 11.1
## 11 yes 1.5 22 11.2
## 12 yes 2 24 11.2
## 13 yes 2.5 12 11.7
## 14 yes 3 7 8.71
## 15 yes 3.5 4 7.5
## 16 yes 4 3 10
## 17 yes 4.5 2 11
## 18 yes 5 5 11.6
By using logistic regression we statistically explore the relationship between the four selected variables and the binary high/low alcohol consumption variable as the target variable.
##
## Call:
## glm(formula = high_use ~ goout + freetime + studytime + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6927 -0.7743 -0.5479 1.0022 2.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.31564 0.62593 -3.700 0.000216 ***
## goout 0.72871 0.12071 6.037 1.57e-09 ***
## freetime 0.08366 0.13415 0.624 0.532863
## studytime -0.58629 0.16508 -3.552 0.000383 ***
## romanticyes -0.18301 0.26722 -0.685 0.493442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 399.98 on 377 degrees of freedom
## AIC: 409.98
##
## Number of Fisher Scoring iterations: 4
## (Intercept) goout freetime studytime romanticyes
## -2.31564309 0.72871166 0.08366382 -0.58628767 -0.18300551
Odd ratio (OR) is obtained with the division of the odds of “success” (Y = 1) for students who have the property X, by the odds of “success” of those who have not. As OR quantifies the relations between X and Y, Odds higher than 1 indicates that X is positively associated with “success”. The odds ratios can be seen also as exponents of the coefficients of a logistic regression model.
Computation of the odds ratio (OR)
Computation of the confidence intervals for the coefficients by the function confint(), and exponentiation of the values by using exp().
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.5691044 -1.1092870
## goout 0.4982923 0.9725916
## freetime -0.1795761 0.3477794
## studytime -0.9207772 -0.2716715
## romanticyes -0.7148612 0.3353712
Obtaining the odds ratio with their confidence intervals by using cbind():
## OR 2.5 % 97.5 %
## (Intercept) 0.09870269 0.02818108 0.3297940
## goout 2.07240892 1.64590816 2.6447898
## freetime 1.08726331 0.83562438 1.4159199
## studytime 0.55638895 0.39820941 0.7621045
## romanticyes 0.83276356 0.48926002 1.3984594
Values bigger than 1 are seen fully in goout, freetime (except for 2,5%), and in 97.5% of romantic, here there is positive correlation. These results moslty confirmedv my hypotheses apart from the studytime.
First we use the function to predict() the probability of high use, after modifying the dataset 'alc' with the new integration we move to predict probabilities and classes, and to tabulate the target variables versus the predictions:
## goout freetime studytime romantic probability prediction
## 373 2 3 1 no 0.23263069 FALSE
## 374 3 4 1 no 0.40585185 FALSE
## 375 3 3 3 no 0.16282192 FALSE
## 376 3 4 1 yes 0.36258869 FALSE
## 377 2 4 3 no 0.09258880 FALSE
## 378 4 3 2 no 0.42009575 FALSE
## 379 2 2 2 no 0.13429940 FALSE
## 380 1 1 2 no 0.06441395 FALSE
## 381 5 4 1 no 0.74578989 TRUE
## 382 1 4 1 no 0.13722123 FALSE
## prediction
## high_use FALSE TRUE
## FALSE 247 21
## TRUE 73 41
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64659686 0.05497382 0.70157068
## TRUE 0.19109948 0.10732984 0.29842932
## Sum 0.83769634 0.16230366 1.00000000
Accuracy measures the performance in binary classifications as the average number of correctly classified observations. The mean of incorrectly classified observations can be seen as a penalty function of the classifier: the less the better. In this section, first we define a loss function loss_func(), and then we apply it to probability = 0, probability = 1 and then to the prediction probabilities in alc.
## [1] 0.2984293
## [1] 0.7015707
## [1] 0.2460733
The first and third functions deliver better results than in the case of probability = 1. It works better than guessing.
Cross validation is a technique to assess how the results of a statistical analysis will generalize to an independent data set. In a cross validation, a sample of data is partitioned into complementary subsets (training, larger and testing, smaller), performing the analysis on the former and validating the results on the latter. The subsets used here are K = 10.
## [1] 0.2460733
With leave-one-out cross validation:
## [1] 0.2460733
with 10-fold cross validation:
## [1] 0.2539267
The ten-fold cross validation shows higher prediction error on the testing data compared to the training data. It is also lower than the 0.26 in the Datacamp exercise.
At first I use a logistic regression model with 22 predictors.
## [1] 0.2460733
The function is performed with leave-one-out cross validation.
## [1] 0.2539267
Here the result is given by ten-fold cross validation.
## [1] 0.2539267
With 15 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2696335
Here the result is given by ten-fold cross validation.
## [1] 0.2617801
With 10 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2670157
Here the result is given by ten-fold cross validation.
## [1] 0.2748691
With 5 predictors
The function is performed with leave-one-out cross validation.
## [1] 0.2460733
Here the result is given by ten-fold cross validation.
## [1] 0.2460733
The dataset Boston, is about the housing values in the suburbs of the homonym city. I use the functions str() and dim() to explore the dataset.Here is its structure:
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
And here its dimension:
## [1] 506 14
Let’s have a look at the summary() of the variables:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Using the function pairs() we obtain the following graphical overview:
From the plot above is a bit difficult to see relations between variables. Let’s try to use something else, for instance a correlation plot. By using the function corrplot() we can obtain a visual way to look at correlations. First we need to calculate the correlation matrix by using cor():
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
Now that we have the matrix, rounded to the first two digits, we can proceed to create the correlation plot by using corrplot(). Here is how it looks like:
The corrplot() provides us with a graphical overview of the Pearson’s correlation coefficient calculated with cor. This measure quantifies the degree to which an association tends to a certain pattern. In this case it summarize the strength of a linear association. As we see here, the value 0 means that two variables are uncorrelated. A value of -1 (in red) or 1 (in blue) shows that they are perfectly related.
As we can see here, the dimensions and intensity of colour of the dots visually shows the strenght of the linear associations. I used order = "hclust" as the ordering method for this correlation matrix as it makes the matrix more immediate to read. Among the strongest negative correlations there are: dis nox, dis indus, dis age, lstat rm, and lstat medv. On the contrary, among the strongest positive correlations we find: tax rad, tax indus, nox indus, nox age. Overall, only the variable chas seems to have very little if none statistical correlation at all.
We scale the dataset by using the scale() function, then we can see the scaled variables with summary():
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
The function scale() operated on the variables by subtracting the columns means from the corresponding columns and dividing the difference with standard deviation. Here it was possible to scale the whole dataset as it contains only numerical values.
The class of the boston_scaled object is a:
## [1] "matrix"
so, to complete the procedure, we change the object into a data frame.
To create the categorial variable, we use the function cut() together with quantile()to have our factor variable divided by quantiles in order to get four rates of crime:
## crime
## low med_low med_high high
## 127 126 126 127
train and test sets.At first we use nrow() to count the number of rows in the dataset:
## [1] 506
then with ind <- sample() we randomly choose a 80% of them to create the train dataset. With the remaining material we create the test set.
In this section we fit a linear discriminant analysis on the train set, using the categorical crime rate as the target variable, and the other variables are the predictors. Here we can see the plot:
We will now run a LDA model on the test data, but before that we will save the crime categories from the test set and then we will remove the categorical crime variable from the test dataset.
Here is the cross tabulation of the results with the crime categories from the test set:
## predicted
## correct low med_low med_high high
## low 12 8 1 0
## med_low 8 17 5 0
## med_high 0 5 14 2
## high 0 0 0 30
MISSING
To measure the distance between the observation, at first we standardize the dataset by using data.Normalization() with type = n1.
Then we run the distance between observations by using the function dist(), which utilizes the euclidean distance, the most common distance measure, then we use also the manhattan method:
After that, we calculate and visualize the total within sum of squares, by using set.seed(123) to prevent assigning random cluster centres, and setting the maximum number of clusters at 10.
Using the elbow method I think I will choose to go with three centers.
then we run the k-means(), I divide the plot in four to improve the clarity:
To be sure, I also try something different, for instance five.
MISSING
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Here, as before, is the procedure for the k-means with clusters > 2:
again, we run the k-means():
And here the LDA model, since the variable chas appeared to be a constant within group calls I removed it:
The most influential variable as cluster linear separator is the variable tax.
In this section we run the code on the train scaled dataset to produce two 3D-plots.
In this first 3D-plot the color is given by the train$crime:
In this second 3D-plot the color is defined by the km$cluster:
Comments on findings and result comparation against previous hypotheses.
Overall, for the chosen variables, only the females seem to present outliers. As for the whiskers the females vary more than males except for the variable romantic. The only skewed plot is the romantic males’ one. Let’s now proceed to the hypotheses’ comparation.
"goout"and H1.Concerning the variable
"goout", I hypotesized (H1) that students who go out more experiences higher alcohol consumption. Overall, it seems that people who go out more, have higher consumptions of alcohol. The most starking differences are between class 1 and 3. But people at 2 have higher consumptions than people at 5.The higher consumption is registered at 3. So it is not self-evident that the more a student goes out, the more drinks, the hypothesis H1 is not entirely correct."freetime"and H2.In regards to the variable
"freetime", its related hypothesis (H2) was that students who have more free time are more prone to drink. Again, the levels for the answers 3 - 4 are much higher than 1 - 2 and 5. On a general level the hypothesis H2 seems correct. But the distributions of the results see a stark decrease at 5, which has even lower levels than at 2."studytime"and H3.When it comes to the findings of the variable
"studytime", it seems to corroborate the hypothesis (H3) according to which the more the studytime, the less a student drinks. The value in 2 is higher than 1 but the consumption decreases at the increasing of the studytime."romantic"and H4.Finally, to the variable
"romantic", I associated the hypothesis (H4) that romantics drink less than non-romantics, due to courting dynamics. The results seem to confirm the hypothesis H4. We see that non-romantics drink as much as twice compared to romantics, males percentage is higher than female one in this category, with a slightly larger gap in non-romantic.